rm(list = ls())
library(tidyverse)
library(tidymodels)
library(readr)
library(tswge)
library(ggplot2)
library(yardstick)
acfdf <- function(vec) {
vacf <- acf(vec, plot = F)
with(vacf, data.frame(lag, acf))
}
ggacf <- function(vec) {
ac <- acfdf(vec)
ggplot(data = ac, aes(x = lag, y = acf)) + geom_hline(aes(yintercept = 0)) +
geom_segment(mapping = aes(xend = lag, yend = 0))
}
tplot <- function(vec) {
df <- data.frame(X = vec, t = seq_along(vec))
ggplot(data = df, aes(x = t, y = X)) + geom_line()
}Biostat 212a Homework 6
Due Mar 22, 2024 @ 11:59PM
Load R libraries.
1 New York Stock Exchange (NYSE) data (1962-1986) (140 pts)
The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).
Log trading volume(\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.Dow Jones return(\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.Log volatility(\(z_t\)): This is based on the absolute values of daily price movements.
# Read in NYSE data from url
url = "https://raw.githubusercontent.com/ucla-biostat-212a/2024winter/master/slides/data/NYSE.csv"
NYSE <- read_csv(url)
NYSE# A tibble: 6,051 × 6
date day_of_week DJ_return log_volume log_volatility train
<date> <chr> <dbl> <dbl> <dbl> <lgl>
1 1962-12-03 mon -0.00446 0.0326 -13.1 TRUE
2 1962-12-04 tues 0.00781 0.346 -11.7 TRUE
3 1962-12-05 wed 0.00384 0.525 -11.7 TRUE
4 1962-12-06 thur -0.00346 0.210 -11.6 TRUE
5 1962-12-07 fri 0.000568 0.0442 -11.7 TRUE
6 1962-12-10 mon -0.0108 0.133 -10.9 TRUE
7 1962-12-11 tues 0.000124 -0.0115 -11.0 TRUE
8 1962-12-12 wed 0.00336 0.00161 -11.0 TRUE
9 1962-12-13 thur -0.00330 -0.106 -11.0 TRUE
10 1962-12-14 fri 0.00447 -0.138 -11.0 TRUE
# ℹ 6,041 more rows
The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.
Code
ggacf(NYSE$log_volume) + ggthemes::theme_few()Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\) Dow Jones return \(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\) Log volatility \(z_{t-\ell}\).
Code
seq(1, 30) %>%
map(function(x) {cor(NYSE$log_volume , lag(NYSE$DJ_return, x), use = "pairwise.complete.obs")}) %>%
unlist() %>%
tibble(lag = 1:30, cor = .) %>%
ggplot(aes(x = lag, y = cor)) +
geom_hline(aes(yintercept = 0)) +
geom_segment(mapping = aes(xend = lag, yend = 0)) +
ggtitle("AutoCorrelation between `log volume` and lagged `DJ return`")Code
seq(1, 30) %>%
map(function(x) {cor(NYSE$log_volume , lag(NYSE$log_volatility, x), use = "pairwise.complete.obs")}) %>%
unlist() %>%
tibble(lag = 1:30, cor = .) %>%
ggplot(aes(x = lag, y = cor)) +
geom_hline(aes(yintercept = 0)) +
geom_segment(mapping = aes(xend = lag, yend = 0)) +
ggtitle("AutoCorrelation between `log volume` and lagged `log volatility`")1.1 Project goal
Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.
The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.
In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.
Pay attention to the nuance of splitting time series data for cross validation. Study and use the time-series functionality in tidymodels. Make sure to use the same splits when tuning different machine learning algorithms.
Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.
1.2 Baseline method (20 pts)
We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.
Before we tune different machine learning methods, let’s first separate into the test and non-test sets. We drop the first 5 trading days which lack some lagged variables.
# Lag: look back L trading days
# Do not need to include, as we included them in receipe
L = 5
for(i in seq(1, L)) {
NYSE <- NYSE %>%
mutate(!!paste("DJ_return_lag", i, sep = "") := lag(NYSE$DJ_return, i),
!!paste("log_volume_lag", i, sep = "") := lag(NYSE$log_volume, i),
!!paste("log_volatility_lag", i, sep = "") := lag(NYSE$log_volatility, i))
}
NYSE <- NYSE %>% na.omit()# Drop beginning trading days which lack some lagged variables
NYSE_other <- NYSE %>%
filter(train == 'TRUE') %>%
select(-train) %>%
drop_na()
dim(NYSE_other)[1] 4276 20
NYSE_test = NYSE %>%
filter(train == 'FALSE') %>%
select(-train) %>%
drop_na()
dim(NYSE_test)[1] 1770 20
library(yardstick)
# cor(NYSE_test$log_volume, NYSE_test$log_volume_lag1) %>% round(2)
r2_test_strawman = rsq_vec(NYSE_test$log_volume, lag(NYSE_test$log_volume, 1)) %>% round(2)
print(paste("Straw man test R2: ", r2_test_strawman))[1] "Straw man test R2: 0.35"
1.3 Autoregression (AR) forecaster (30 pts)
Let \[ y = \begin{pmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_T \end{pmatrix}, \quad M = \begin{pmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_{L} & \cdots & v_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{pmatrix}. \]
Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[ \hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L}, \] known as an order-\(L\) autoregression model or AR(\(L\)).
Before we start the model training, let’s talk about time series resampling. We will use the
rolling_originfunction in thersamplepackage to create a time series cross-validation plan.When the data have a strong time component, a resampling method should support modeling to estimate seasonal and other temporal trends within the data. A technique that randomly samples values from the training set can disrupt the model’s ability to estimate these patterns.
NYSE %>%
ggplot(aes(x = date, y = log_volume)) +
geom_line() +
geom_smooth(method = "lm")wrong_split <- initial_split(NYSE_other)
bind_rows(
training(wrong_split) %>% mutate(type = "train"),
testing(wrong_split) %>% mutate(type = "test")
) %>%
ggplot(aes(x = date, y = log_volume, color = type, group = NA)) +
geom_line()correct_split <- initial_time_split(NYSE_other %>% arrange(date))
bind_rows(
training(correct_split) %>% mutate(type = "train"),
testing(correct_split) %>% mutate(type = "test")
) %>%
ggplot(aes(x = date, y = log_volume, color = type, group = NA)) +
geom_line()rolling_origin(NYSE_other %>% arrange(date), initial = 30, assess = 7) %>%
#sliding_period(NYSE_other %>% arrange(date), date, period = "day", lookback = Inf, assess_stop = 1) %>%
mutate(train_data = map(splits, analysis),
test_data = map(splits, assessment)) %>%
select(-splits) %>%
pivot_longer(-id) %>%
filter(id %in% c("Slice0001", "Slice0002", "Slice0003")) %>%
unnest(value) %>%
ggplot(aes(x = date, y = log_volume, color = name, group = NA)) +
geom_point() +
geom_line() +
facet_wrap(~id, scales = "fixed")sliding_period(NYSE_other %>% arrange(date),
date, period = "month", lookback = Inf, assess_stop = 1) %>%
mutate(train_data = map(splits, analysis),
test_data = map(splits, assessment)) %>%
select(-splits) %>%
pivot_longer(-id) %>%
filter(id %in% c("Slice001", "Slice002", "Slice003")) %>%
unnest(value) %>%
ggplot(aes(x = date, y = log_volume, color = name, group = NA)) +
geom_point() +
geom_line() +
facet_wrap(~id, scales = "fixed")Rolling forecast origin resampling (Hyndman and Athanasopoulos 2018) provides a method that emulates how time series data is often partitioned in practice, estimating the model with historical data and evaluating it with the most recent data.
Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.
1.4 Preprocessing
en_receipe <-
recipe(log_volume ~ ., data = NYSE_other) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric_predictors(), -all_outcomes()) %>%
step_naomit(all_predictors()) %>%
prep(data = NYSE_other)
en_receipe1.5 Model training
### Model
enet_mod <-
# mixture = 0 (ridge), mixture = 1 (lasso)
# mixture = (0, 1) elastic net
# As an example, we set mixture = 0.5. It needs to be tuned.
linear_reg(penalty = tune(), mixture = 0.5) %>%
set_engine("glmnet")
enet_modLinear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = 0.5
Computational engine: glmnet
en_wf <-
workflow() %>%
add_model(enet_mod) %>%
add_recipe(en_receipe %>% step_rm(date) %>% step_indicate_na())
en_wf══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = 0.5
Computational engine: glmnet
folds <- NYSE_other %>% arrange(date) %>%
sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
# rolling_origin(initial = 5, assess = 1)
folds# Sliding period resampling
# A tibble: 204 × 2
splits id
<list> <chr>
1 <split [15/22]> Slice001
2 <split [37/19]> Slice002
3 <split [56/21]> Slice003
4 <split [77/21]> Slice004
5 <split [98/22]> Slice005
6 <split [120/20]> Slice006
7 <split [140/22]> Slice007
8 <split [162/22]> Slice008
9 <split [184/20]> Slice009
10 <split [204/23]> Slice010
# ℹ 194 more rows
month_folds <- NYSE_other %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4)
month_folds# Sliding period resampling
# A tibble: 200 × 2
splits id
<list> <chr>
1 <split [98/22]> Slice001
2 <split [120/20]> Slice002
3 <split [140/22]> Slice003
4 <split [162/22]> Slice004
5 <split [184/20]> Slice005
6 <split [204/23]> Slice006
7 <split [227/18]> Slice007
8 <split [245/21]> Slice008
9 <split [266/22]> Slice009
10 <split [288/19]> Slice010
# ℹ 190 more rows
lambda_grid <-
grid_regular(penalty(range = c(-8, -7), trans = log10_trans()), levels = 3)
lambda_grid# A tibble: 3 × 1
penalty
<dbl>
1 0.00000001
2 0.0000000316
3 0.0000001
en_fit <- tune_grid(en_wf, resamples = month_folds, grid = lambda_grid)
en_fit# Tuning results
# Sliding period resampling
# A tibble: 200 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [98/22]> Slice001 <tibble [6 × 5]> <tibble [0 × 3]>
2 <split [120/20]> Slice002 <tibble [6 × 5]> <tibble [0 × 3]>
3 <split [140/22]> Slice003 <tibble [6 × 5]> <tibble [0 × 3]>
4 <split [162/22]> Slice004 <tibble [6 × 5]> <tibble [0 × 3]>
5 <split [184/20]> Slice005 <tibble [6 × 5]> <tibble [0 × 3]>
6 <split [204/23]> Slice006 <tibble [6 × 5]> <tibble [0 × 3]>
7 <split [227/18]> Slice007 <tibble [6 × 5]> <tibble [0 × 3]>
8 <split [245/21]> Slice008 <tibble [6 × 5]> <tibble [0 × 3]>
9 <split [266/22]> Slice009 <tibble [6 × 5]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [6 × 5]> <tibble [0 × 3]>
# ℹ 190 more rows
en_fit %>%
collect_metrics() %>%
print(width = Inf) %>%
filter(.metric == "rsq") %>%
ggplot(mapping = aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Penalty", y = "CV RMSE") +
scale_x_log10(labels = scales::label_number())# A tibble: 6 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.00000001 rmse standard 0.133 200 0.00287 Preprocessor1_Model1
2 0.00000001 rsq standard 0.381 200 0.0149 Preprocessor1_Model1
3 0.0000000316 rmse standard 0.133 200 0.00287 Preprocessor1_Model2
4 0.0000000316 rsq standard 0.381 200 0.0149 Preprocessor1_Model2
5 0.0000001 rmse standard 0.133 200 0.00287 Preprocessor1_Model3
6 0.0000001 rsq standard 0.381 200 0.0149 Preprocessor1_Model3
en_fit %>%
show_best("rsq")# A tibble: 3 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.00000001 rsq standard 0.381 200 0.0149 Preprocessor1_Model1
2 0.0000000316 rsq standard 0.381 200 0.0149 Preprocessor1_Model2
3 0.0000001 rsq standard 0.381 200 0.0149 Preprocessor1_Model3
best_en_fit <- en_fit %>%
select_best("rsq")
best_en_fit# A tibble: 1 × 2
penalty .config
<dbl> <chr>
1 0.00000001 Preprocessor1_Model1
# Final workflow
final_wf_en <- en_wf %>%
finalize_workflow(best_en_fit)
final_wf_en══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 1e-08
mixture = 0.5
Computational engine: glmnet
# Fit the whole training set, then predict the test cases
final_fit_en <-
final_wf_en %>%
last_fit(correct_split)
final_fit_en# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble> <workflow>
# Test metrics
final_fit_en %>% collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.138 Preprocessor1_Model1
2 rsq standard 0.688 Preprocessor1_Model1
predictions <- final_fit_en %>%
collect_predictions()
predictions# A tibble: 1,069 × 5
id .pred .row log_volume .config
<chr> <dbl> <int> <dbl> <chr>
1 train/test split -0.135 3208 0.0353 Preprocessor1_Model1
2 train/test split -0.0388 3209 0.0338 Preprocessor1_Model1
3 train/test split -0.106 3210 -0.141 Preprocessor1_Model1
4 train/test split -0.0471 3211 -0.352 Preprocessor1_Model1
5 train/test split -0.160 3212 0.154 Preprocessor1_Model1
6 train/test split 0.0224 3213 -0.167 Preprocessor1_Model1
7 train/test split -0.116 3214 0.102 Preprocessor1_Model1
8 train/test split -0.0778 3215 -0.0838 Preprocessor1_Model1
9 train/test split -0.0669 3216 -0.247 Preprocessor1_Model1
10 train/test split -0.0560 3217 0.205 Preprocessor1_Model1
# ℹ 1,059 more rows
- Hint: Workflow: Lasso is a good starting point.
1.6 Random forest forecaster (30pts)
Use the same features as in AR(\(L\)) for the random forest. Tune the random forest and evaluate the test performance.
Hint: Workflow: Random Forest for Prediction is a good starting point.
rf_recipe <- recipe(log_volume ~ ., data = NYSE_other) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric_predictors(), -all_outcomes()) %>%
step_naomit(all_predictors()) %>%
prep(data = NYSE_other)
rf_reciperf_mod <-
rand_forest(
mode = "regression",
# Number of predictors randomly sampled in each split
mtry = tune(),
# Number of trees in ensemble
trees = tune()
) %>%
set_engine("ranger")
rf_modRandom Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = tune()
Computational engine: ranger
rf_wf <- workflow() %>%
add_recipe(rf_recipe %>% step_rm(date) %>% step_indicate_na()) %>%
add_model(rf_mod)
rf_wf══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = tune()
Computational engine: ranger
folds <- NYSE_other %>% arrange(date) %>%
sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
# rolling_origin(initial = 5, assess = 1)
folds# Sliding period resampling
# A tibble: 204 × 2
splits id
<list> <chr>
1 <split [15/22]> Slice001
2 <split [37/19]> Slice002
3 <split [56/21]> Slice003
4 <split [77/21]> Slice004
5 <split [98/22]> Slice005
6 <split [120/20]> Slice006
7 <split [140/22]> Slice007
8 <split [162/22]> Slice008
9 <split [184/20]> Slice009
10 <split [204/23]> Slice010
# ℹ 194 more rows
month_folds <- NYSE_other %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4)
month_folds# Sliding period resampling
# A tibble: 200 × 2
splits id
<list> <chr>
1 <split [98/22]> Slice001
2 <split [120/20]> Slice002
3 <split [140/22]> Slice003
4 <split [162/22]> Slice004
5 <split [184/20]> Slice005
6 <split [204/23]> Slice006
7 <split [227/18]> Slice007
8 <split [245/21]> Slice008
9 <split [266/22]> Slice009
10 <split [288/19]> Slice010
# ℹ 190 more rows
rf_grid <- grid_regular(
trees(range = c(100L, 300L)),
mtry(range = c(1L, 5L)),
levels = c(3, 5)
)
rf_grid# A tibble: 15 × 2
trees mtry
<int> <int>
1 100 1
2 200 1
3 300 1
4 100 2
5 200 2
6 300 2
7 100 3
8 200 3
9 300 3
10 100 4
11 200 4
12 300 4
13 100 5
14 200 5
15 300 5
rf_fit <- tune_grid(
rf_wf,
resamples = month_folds,
grid = rf_grid,
metrics = metric_set(rmse, rsq)
)
rf_fit# Tuning results
# Sliding period resampling
# A tibble: 200 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [98/22]> Slice001 <tibble [30 × 6]> <tibble [0 × 3]>
2 <split [120/20]> Slice002 <tibble [30 × 6]> <tibble [0 × 3]>
3 <split [140/22]> Slice003 <tibble [30 × 6]> <tibble [0 × 3]>
4 <split [162/22]> Slice004 <tibble [30 × 6]> <tibble [0 × 3]>
5 <split [184/20]> Slice005 <tibble [30 × 6]> <tibble [0 × 3]>
6 <split [204/23]> Slice006 <tibble [30 × 6]> <tibble [0 × 3]>
7 <split [227/18]> Slice007 <tibble [30 × 6]> <tibble [0 × 3]>
8 <split [245/21]> Slice008 <tibble [30 × 6]> <tibble [0 × 3]>
9 <split [266/22]> Slice009 <tibble [30 × 6]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [30 × 6]> <tibble [0 × 3]>
# ℹ 190 more rows
rf_fit %>%
collect_metrics() %>%
print(width = Inf) %>%
filter(.metric == "rsq") %>%
mutate(mtry = as.factor(mtry)) %>%
ggplot(mapping = aes(x = trees, y = mean, color = mtry)) +
# geom_point() +
geom_line() +
labs(x = "Num. of Trees", y = "CV mse")# A tibble: 30 × 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 100 rmse standard 0.161 200 0.00385 Preprocessor1_Model01
2 1 100 rsq standard 0.312 200 0.0142 Preprocessor1_Model01
3 1 200 rmse standard 0.161 200 0.00388 Preprocessor1_Model02
4 1 200 rsq standard 0.312 200 0.0141 Preprocessor1_Model02
5 1 300 rmse standard 0.161 200 0.00389 Preprocessor1_Model03
6 1 300 rsq standard 0.318 200 0.0143 Preprocessor1_Model03
7 2 100 rmse standard 0.148 200 0.00330 Preprocessor1_Model04
8 2 100 rsq standard 0.322 200 0.0145 Preprocessor1_Model04
9 2 200 rmse standard 0.147 200 0.00334 Preprocessor1_Model05
10 2 200 rsq standard 0.331 200 0.0145 Preprocessor1_Model05
# ℹ 20 more rows
# Final workflow
rf_fit %>%
show_best("rsq")# A tibble: 5 × 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 300 rsq standard 0.340 200 0.0146 Preprocessor1_Model15
2 4 300 rsq standard 0.339 200 0.0146 Preprocessor1_Model12
3 3 200 rsq standard 0.336 200 0.0147 Preprocessor1_Model08
4 5 100 rsq standard 0.335 200 0.0149 Preprocessor1_Model13
5 5 200 rsq standard 0.335 200 0.0146 Preprocessor1_Model14
best_rf <- rf_fit %>%
select_best("rsq")
best_rf# A tibble: 1 × 3
mtry trees .config
<int> <int> <chr>
1 5 300 Preprocessor1_Model15
# Final workflow
final_wf_rf <- rf_wf %>%
finalize_workflow(best_rf)
final_wf_rf══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 5
trees = 300
Computational engine: ranger
# Fit the whole training set, then predict the test cases
final_fit_rf <-
final_wf_rf %>%
last_fit(correct_split)
final_fit_rf# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble> <workflow>
# Test metrics
final_fit_rf %>%
collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.150 Preprocessor1_Model1
2 rsq standard 0.652 Preprocessor1_Model1
1.7 Boosting forecaster (30pts)
Use the same features as in AR(\(L\)) for the boosting. Tune the boosting algorithm and evaluate the test performance.
Hint: Workflow: Boosting tree for Prediction is a good starting point.
boost_recipe <- recipe(log_volume ~ ., data = NYSE_other) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric_predictors(), -all_outcomes()) %>%
step_naomit(all_predictors()) %>%
prep(data = NYSE_other)
boost_recipegb_mod <-
boost_tree(
mode = "regression",
trees = 1000,
tree_depth = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost")
gb_modBoosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
boost_workflow <- workflow() %>%
add_model(gb_mod) %>%
add_recipe(boost_recipe %>% step_rm(date) %>% step_indicate_na())
boost_workflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
tree_depth = tune()
learn_rate = tune()
Computational engine: xgboost
folds <- NYSE_other %>% arrange(date) %>%
sliding_period(date, period = "month", lookback = Inf, assess_stop = 1)
# rolling_origin(initial = 5, assess = 1)
month_folds <- NYSE_other %>%
sliding_period(
date,
"month",
lookback = Inf,
skip = 4)boost_grid <- grid_regular(
tree_depth(range = c(1L, 2L)),
learn_rate(range = c(-3, -2), trans = log10_trans()),
levels = c(2, 3)
)
boost_grid# A tibble: 6 × 2
tree_depth learn_rate
<int> <dbl>
1 1 0.001
2 2 0.001
3 1 0.00316
4 2 0.00316
5 1 0.01
6 2 0.01
boost_fit <- tune_grid(
boost_workflow,
resamples = month_folds,
grid = boost_grid,
metrics = metric_set(rmse, rsq)
)
boost_fit# Tuning results
# Sliding period resampling
# A tibble: 200 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [98/22]> Slice001 <tibble [12 × 6]> <tibble [0 × 3]>
2 <split [120/20]> Slice002 <tibble [12 × 6]> <tibble [0 × 3]>
3 <split [140/22]> Slice003 <tibble [12 × 6]> <tibble [0 × 3]>
4 <split [162/22]> Slice004 <tibble [12 × 6]> <tibble [0 × 3]>
5 <split [184/20]> Slice005 <tibble [12 × 6]> <tibble [0 × 3]>
6 <split [204/23]> Slice006 <tibble [12 × 6]> <tibble [0 × 3]>
7 <split [227/18]> Slice007 <tibble [12 × 6]> <tibble [0 × 3]>
8 <split [245/21]> Slice008 <tibble [12 × 6]> <tibble [0 × 3]>
9 <split [266/22]> Slice009 <tibble [12 × 6]> <tibble [0 × 3]>
10 <split [288/19]> Slice010 <tibble [12 × 6]> <tibble [0 × 3]>
# ℹ 190 more rows
boost_fit %>%
collect_metrics() %>%
print(width = Inf) %>%
filter(.metric == "rsq") %>%
ggplot(mapping = aes(x = learn_rate, y = mean, color = factor(tree_depth))) +
geom_point() +
geom_line() +
labs(x = "Learning Rate", y = "CV AUC") +
scale_x_log10()# A tibble: 12 × 8
tree_depth learn_rate .metric .estimator mean n std_err
<int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 1 0.001 rmse standard 0.257 200 0.00562
2 1 0.001 rsq standard 0.203 200 0.0120
3 2 0.001 rmse standard 0.251 200 0.00502
4 2 0.001 rsq standard 0.239 200 0.0126
5 1 0.00316 rmse standard 0.160 200 0.00356
6 1 0.00316 rsq standard 0.253 200 0.0129
7 2 0.00316 rmse standard 0.150 200 0.00313
8 2 0.00316 rsq standard 0.303 200 0.0141
9 1 0.01 rmse standard 0.144 200 0.00307
10 1 0.01 rsq standard 0.322 200 0.0143
11 2 0.01 rmse standard 0.136 200 0.00283
12 2 0.01 rsq standard 0.369 200 0.0153
.config
<chr>
1 Preprocessor1_Model1
2 Preprocessor1_Model1
3 Preprocessor1_Model2
4 Preprocessor1_Model2
5 Preprocessor1_Model3
6 Preprocessor1_Model3
7 Preprocessor1_Model4
8 Preprocessor1_Model4
9 Preprocessor1_Model5
10 Preprocessor1_Model5
11 Preprocessor1_Model6
12 Preprocessor1_Model6
boost_fit %>%
show_best("rsq")# A tibble: 5 × 8
tree_depth learn_rate .metric .estimator mean n std_err .config
<int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2 0.01 rsq standard 0.369 200 0.0153 Preprocessor1_Mo…
2 1 0.01 rsq standard 0.322 200 0.0143 Preprocessor1_Mo…
3 2 0.00316 rsq standard 0.303 200 0.0141 Preprocessor1_Mo…
4 1 0.00316 rsq standard 0.253 200 0.0129 Preprocessor1_Mo…
5 2 0.001 rsq standard 0.239 200 0.0126 Preprocessor1_Mo…
best_gb <- boost_fit %>%
select_best("rsq")
best_gb# A tibble: 1 × 3
tree_depth learn_rate .config
<int> <dbl> <chr>
1 2 0.01 Preprocessor1_Model6
# Final workflow
final_wf_boost <- boost_workflow %>%
finalize_workflow(best_gb)
final_wf_boost══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_dummy()
• step_normalize()
• step_naomit()
• step_rm()
• step_indicate_na()
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 1000
tree_depth = 2
learn_rate = 0.01
Computational engine: xgboost
# Fit the whole training set, then predict the test cases
final_fit_boost <-
final_wf_boost %>%
last_fit(correct_split)
final_fit_boost# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [3207/1069]> train/test split <tibble> <tibble> <tibble> <workflow>
# Test metrics
final_fit_boost %>%
collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.147 Preprocessor1_Model1
2 rsq standard 0.665 Preprocessor1_Model1
1.8 Summary (30pts)
Your score for this question is largely determined by your final test performance.
Summarize the performance of different machine learning forecasters in the following format.
The summary of the performance of various machine learning forecasters on predicting the daily Log trading volume using the New York Stock Exchange data from 1962 to 1986, with training data up to January 2, 1980, is presented below. The evaluation of model performance was based on the coefficient of determination (R^2), both on cross-validation (CV) and on the unseen test data. The R^2 metric measures the proportion of variance in the dependent variable that is predictable from the independent variables, with values closer to 1 indicating better model performance. | Method | CV \(R^2\) | Test \(R^2\) | |:——:|:——:|:——:|:——:| | Baseline | - |Straw man test R2: 0.35 | | AR(5) | 0.38 | 0.69 | | Random Forest | 0.34| 0.65| | Boosting | 0.37| 0.67| Baseline (Straw man): The baseline model, which uses the previous day’s log trading volume as the prediction for the current day, achieved a R^2 of 0.35 on the test set. This simple model serves as a point of comparison for more complex models.
AR(5): An autoregressive model of order 5 (AR(5)), which uses the past 5 days of log trading volume to predict the current day’s volume, showed an improvement over the baseline, with a CV R^2 of 0.38 and a test R^2 of 0.69. This indicates that incorporating more historical data points helps in improving the prediction accuracy.
Random Forest: The Random Forest model, which is a non-linear and more complex model compared to AR(5), achieved a slightly lower CV R^2 of 0.34 but also a lower test R^2 of 0.65. This suggests that despite its ability to capture complex patterns in the data, it might be overfitting to the training data or not capturing the time series dynamics as effectively as linear models in this specific case.
Boosting: The Boosting model, another sophisticated and flexible ensemble method, showed a CV R^2 of 0.37 and a test R^2 of 0.67. It performed better than the Random Forest model but was still slightly behind the AR(5) model in terms of test performance. Boosting’s iterative approach to correcting errors might be beneficial for this time series forecasting task, although it didn’t outperform the simpler AR(5) model.
Conclusion: The AR(5) model, despite its simplicity, provided the best performance on the test data among the models evaluated. This suggests that for the task of forecasting daily Log trading volume, a linear approach that directly incorporates recent historical values can be very effective. Meanwhile, the more complex Random Forest and Boosting models did not achieve higher performance, highlighting the importance of model selection and the possibility that simpler models may suffice for certain forecasting tasks. ### Extension reading
- MOIRAI: Salesforce’s Foundation Model for Time-Series Forecasting
2 ISL Exercise 12.6.13 (90 pts)
- On the book website, www.statlearning.com, there is a gene expres- sion data set (Ch12Ex13.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
- Load in the data using read.csv(). You will need to select header = F.
- Apply hierarchical clustering to the samples using correlation- based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
- Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.
import data
library(tidyverse)
library(cluster)
library(ggplot2)
# Load the data
gene_express <- read.csv('../hw6/Ch12Ex13.csv', header = FALSE)
# Few top rows of the dataset:
head(gene_express) V1 V2 V3 V4 V5 V6
1 -0.96193340 0.4418028 -0.9750051 1.4175040 0.8188148 0.3162937
2 -0.29252570 -1.1392670 0.1958370 -1.2811210 -0.2514393 2.5119970
3 0.25878820 -0.9728448 0.5884858 -0.8002581 -1.8203980 -2.0589240
4 -1.15213200 -2.2131680 -0.8615249 0.6309253 0.9517719 -1.1657240
5 0.19578280 0.5933059 0.2829921 0.2471472 1.9786680 -0.8710180
6 0.03012394 -0.6910143 -0.4034258 -0.7298590 -0.3640986 1.1253490
V7 V8 V9 V10 V11 V12
1 -0.02496682 -0.06396600 0.03149702 -0.3503106 -0.7227299 -0.2819547
2 -0.92220620 0.05954277 -1.40964500 -0.6567122 -0.1157652 0.8259783
3 -0.06476437 1.59212400 -0.17311700 -0.1210874 -0.1875790 -1.5001630
4 -0.39155860 1.06361900 -0.35000900 -1.4890580 -0.2432189 -0.4330340
5 -0.98971500 -1.03225300 -1.10965400 -0.3851423 1.6509570 -1.7449090
6 -1.40404100 -0.80613040 -1.23792400 0.5776018 -0.2720642 2.1765620
V13 V14 V15 V16 V17 V18
1 1.33751500 0.70197980 1.0076160 -0.4653828 0.6385951 0.2867807
2 0.34644960 -0.56954860 -0.1315365 0.6902290 -0.9090382 1.3026420
3 -1.22873700 0.85598900 1.2498550 -0.8980815 0.8702058 -0.2252529
4 -0.03879128 -0.05789677 -1.3977620 -0.1561871 -2.7359820 0.7756169
5 -0.37888530 -0.67982610 -2.1315840 -0.2301718 0.4661243 -1.8004490
6 1.43640700 -1.02578100 0.2981582 -0.5559659 0.2046529 -1.1916480
V19 V20 V21 V22 V23 V24
1 -0.2270782 -0.22004520 -1.2425730 -0.1085056 -1.8642620 -0.5005122
2 -1.6726950 -0.52550400 0.7979700 -0.6897930 0.8995305 0.4285812
3 0.4502892 0.55144040 0.1462943 0.1297400 1.3042290 -1.6619080
4 0.6141562 2.01919400 1.0811390 -1.0766180 -0.2434181 0.5134822
5 0.6262904 -0.09772305 -0.2997108 -0.5295591 -2.0235670 -0.5108402
6 0.2350916 0.67096470 0.1307988 1.0689940 1.2309870 1.1344690
V25 V26 V27 V28 V29 V30
1 -1.32500800 1.06341100 -0.2963712 -0.1216457 0.08516605 0.62417640
2 -0.67611410 -0.53409490 -1.7325070 -1.6034470 -1.08362000 0.03342185
3 -1.63037600 -0.07742528 1.3061820 0.7926002 1.55946500 -0.68851160
4 -0.51285780 2.55167600 -2.3143010 -1.2764700 -1.22927100 1.43439600
5 0.04600274 1.26803000 -0.7439868 0.2231319 0.85846280 0.27472610
6 0.55636800 -0.35876640 1.0798650 -0.2064905 -0.00616453 0.16425470
V31 V32 V33 V34 V35 V36
1 -0.5095915 -0.216725500 -0.05550597 -0.4844491 -0.5215811 1.9491350
2 1.7007080 0.007289556 0.09906234 0.5638533 -0.2572752 -0.5817805
3 -0.6154720 0.009999363 0.94581000 -0.3185212 -0.1178895 0.6213662
4 -0.2842774 0.198945600 -0.09183320 0.3496279 -0.2989097 1.5136960
5 -0.6929984 -0.845707200 -0.17749680 -0.1664908 1.4831550 -1.6879460
6 1.1567370 0.241774500 0.08863952 0.1829540 0.9426771 -0.2096004
V37 V38 V39 V40
1 1.32433500 0.4681471 1.06110000 1.6559700
2 -0.16988710 -0.5423036 0.31293890 -1.2843770
3 -0.07076396 0.4016818 -0.01622713 -0.5265532
4 0.67118470 0.0108553 -1.04368900 1.6252750
5 -0.14142960 0.2007785 -0.67594210 2.2206110
6 0.53626210 -1.1852260 -0.42274760 0.6243603
2.1 12.6.13 (b) (30 pts)
There are several methods of “dissimilarity” (correlation based distance) which can be explored for creation of distance matrix.
I am using the following method to create a distance matrix: Dissimilarity = 1 - Correlation the as.dist() function is used here to assign the correlation values to be “distances” Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. The hclust() function implements hierarchical clustering in R.
We now plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering using correlation based distance.
dd=as.dist(1- cor(gene_express))
# To plot the dendrograms obtained using the usual plot() function.
# The numbers at the bottom of the plot identify each observation.
plot(hclust(dd, method ="complete"), main=" Complete Linkage
with Correlation - Based Distance ", xlab="", sub ="")plot(hclust(dd, method ="average"), main=" Average Linkage
with Correlation - Based Distance ", xlab="", sub ="")plot(hclust(dd, method ="single"), main=" Single Linkage
with Correlation - Based Distance ", xlab="", sub ="")Inference:
The choice of linkage certainly does affect the results obtained. As we see, we obtain two clusters for complete and single linkages and three clusters for average linkage Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. + On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.
2.2 PCA and UMAP (30 pts)
library(FactoMineR)
library(factoextra)
# Applying PCA
pca_res <- prcomp(t(gene_express), scale. = TRUE)
# Plotting PCA results
fviz_pca_ind(pca_res,
col.ind = rep(c("Healthy", "Diseased"), each = 20), # Color by group
legend.title = "Group",
title = "PCA of Gene Expression Data")UMAP
library(umap)
# Preparing data for UMAP
data_for_umap <- t(gene_express) # Transposing the data may be necessary depending on the function
# Applying UMAP
umap_res <- umap(data_for_umap)
# Plotting UMAP results
df_umap <- as.data.frame(umap_res$layout)
df_umap$Group <- rep(c("Healthy", "Diseased"), each = 20)
ggplot(df_umap, aes(V1, V2, color = Group)) +
geom_point() +
theme_minimal() +
labs(title = "UMAP of Gene Expression Data",
x = "UMAP 1", y = "UMAP 2", color = "Group")2.3 12.6.13 (c) (30 pts)
To analyze the gene expression data and identify which genes exhibit the most significant differences between two groups (healthy and diseased), we will employ the prcomp() function for Principal Component Analysis (PCA). This technique reduces the dimensionality while preserving as much of the variation as possible. By setting scale = TRUE, we ensure that each variable is not only centered to have a mean of zero but is also scaled to have a unit standard deviation. This standardization is crucial for comparing features that may have different units or scales of measurement.
The output of prcomp(), specifically the rotation component, contains the loadings of the principal components. These loadings can be interpreted as the contribution or weight of each gene to the principal components. Essentially, they help in understanding how strongly each gene is associated with the different dimensions (principal components) extracted during PCA. In this context, examining the loadings can highlight the genes that differ most significantly between the healthy and diseased groups, as indicated by their contribution to the principal axes of variation within the data.
pr.out <- prcomp(t(gene_express), scale = TRUE)
head(pr.out$rotation) PC1 PC2 PC3 PC4 PC5
[1,] -0.005248643 0.02924652 -0.003735358 -0.016526153 0.006318053
[2,] -0.002162728 0.04373711 -0.071475379 0.036521467 -0.010739305
[3,] 0.014444163 0.01361593 -0.011068681 -0.075011306 0.002396073
[4,] 0.014868417 -0.01975169 -0.038690207 0.018872555 -0.007595757
[5,] 0.010600362 -0.02949636 0.016983813 0.005730388 -0.019518632
[6,] 0.030133863 0.03181327 0.016420667 0.016799619 0.037829117
PC6 PC7 PC8 PC9 PC10
[1,] -0.049838208 -0.01283006 0.0215726238 0.007640568 -0.042478511
[2,] 0.044918065 0.02817240 -0.0372224597 -0.030437604 0.077891794
[3,] -0.073571187 -0.03590103 0.0385994243 0.013067476 0.004386205
[4,] -0.006314036 0.04078925 0.0032669777 0.013650293 -0.003238640
[5,] 0.025142879 0.03502170 0.0357811899 0.036519007 -0.078427052
[6,] 0.076979558 0.01745551 0.0002496599 -0.037883486 0.001399260
PC11 PC12 PC13 PC14 PC15
[1,] -0.06799323 -0.066325536 0.049657614 -0.0105490714 0.002388065
[2,] 0.03504599 0.007576127 -0.013509593 -0.0120626555 0.017620628
[3,] 0.02675494 -0.004421772 -0.023358127 -0.0002795565 0.005085890
[4,] -0.04521873 -0.084699633 -0.002718848 -0.0170544115 -0.047018603
[5,] -0.06127074 -0.004607158 0.021907325 -0.0174513795 -0.014291851
[6,] 0.06333435 0.001071488 0.031586824 0.0091279070 0.012496721
PC16 PC17 PC18 PC19 PC20
[1,] 0.057226217 0.005064164 -0.0368384794 -0.01386445 0.008475267
[2,] -0.038753365 0.018177190 0.0065293541 0.02616184 -0.008769863
[3,] -0.008360650 -0.033206319 0.0036294280 -0.02014200 -0.024457400
[4,] -0.027114939 -0.037841235 0.0090504436 0.03994846 0.014224876
[5,] -0.010468740 -0.027149195 0.0487582488 0.02195479 0.060265982
[6,] 0.004923492 -0.003525826 0.0006219705 0.03129817 0.023261190
PC21 PC22 PC23 PC24 PC25
[1,] 0.0194249905 0.000140247 -0.02989553 -0.076029645 -0.004324215
[2,] -0.0078725765 -0.026677304 -0.03100535 -0.002287257 -0.021849136
[3,] -0.0093182521 0.024043186 0.05819660 0.018947343 0.014286045
[4,] 0.0554784102 0.059728272 -0.04416882 0.030836222 0.002601610
[5,] -0.0000414575 -0.039202866 0.01982787 -0.037371354 0.003398134
[6,] -0.0105442468 0.006799149 -0.06354254 0.019801778 0.007791356
PC26 PC27 PC28 PC29 PC30
[1,] -0.005802548 0.053495473 0.03324798 0.039595474 0.0008190378
[2,] -0.018643636 0.051218992 0.01319940 -0.005557906 -0.0085837033
[3,] -0.012192009 -0.083097382 0.02438412 0.012882579 -0.0032415950
[4,] 0.018058462 0.041462682 0.05223420 0.003324611 0.0449676174
[5,] -0.027906713 0.014694086 -0.01277924 0.001121646 0.0203340545
[6,] -0.053747211 0.008386746 -0.01655763 -0.005120959 -0.0673679957
PC31 PC32 PC33 PC34 PC35
[1,] 0.04451014 0.0001689378 -0.025423202 -0.010178476 0.021477278
[2,] 0.06072757 -0.0089395354 0.042380402 0.005784563 0.002459284
[3,] -0.02798607 -0.0406599481 -0.004561497 0.077879450 -0.005489991
[4,] 0.02768043 -0.0086310349 0.014854858 -0.006997894 -0.023272536
[5,] -0.06631350 -0.0079579383 0.033064563 -0.017950552 -0.050971271
[6,] -0.01807539 -0.0222340328 0.016882833 0.005770649 0.067631690
PC36 PC37 PC38 PC39 PC40
[1,] -0.019328540 0.027045885 -0.01467986 -0.00732158 0.05082225
[2,] 0.042550057 0.004536028 0.01595791 0.02267139 0.33237431
[3,] 0.017625830 0.003143908 -0.02153624 -0.01162287 -0.19231512
[4,] 0.015260530 -0.009337707 -0.03368954 0.01920661 0.17195042
[5,] -0.038077959 -0.003937376 0.02531928 -0.02512450 -0.11504928
[6,] 0.008485763 -0.044439490 -0.01276119 -0.01742770 0.14202453
gene.load <- apply(pr.out$rotation, 1, sum)
gene.differ <- order(abs(gene.load), decreasing = TRUE)
#the top most different genes across the two groups
gene.differ[1:20] [1] 2 755 889 676 960 475 907 174 878 840 374 281 673 716 914 370 567 394 984
[20] 28
# Assuming the first 20 samples are healthy and the next 20 are diseased
healthy <- gene_express[1:20, ]
diseased <- gene_express[21:40, ]
# Applying t-test for each gene and storing p-values
p_values <- apply(gene_express, 2, function(gene) t.test(gene[1:20], gene[21:40])$p.value)
# Adjusting for multiple testing (e.g., Bonferroni correction)
p_adjusted <- p.adjust(p_values, method = "bonferroni")
# Identifying genes with significant differences
significant_genes <- which(p_adjusted < 0.05)
# Print significant genes
print(significant_genes)V29 V39 V40
29 39 40
The PCA approach gives you genes that are important for the variance in the dataset but does not directly tell you about the difference between healthy and diseased samples. The t-test approach provides a direct answer to which genes have significantly different expressions between the two groups. Therefore, if your collaborator’s question is specifically about which genes differ the most between healthy and diseased groups, the t-test approach provides the direct answer: genes 29, 39, and 40 have significantly different expressions between the two groups, as indicated by the adjusted p-values being below 0.05.